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Abstract —  In  the  current  era,  data  usually  has  a  high 
volume,  variety,  velocity,  and  veracity,  these  are  known 
as  4  V’s  of  Big  Data.  Social  media  is  considered  as  one  of 
the  main  causes  of  Big  Data  which  get  the  4  V’s  of  Big 
Data  beside  that  it  has  high  dimensionality.  To 
manipulate  Big  Data  efficiently;  its  dimensionality  should 
be  decreased.  Reducing  dimensionality  converts  the  data 
with  high  dimensionality  into  an  expressive 
representation  of  data  with  lower  dimensions.  This 
research  work  deals  with  efficient  Dimension  Reduction 
processes  to  reduce  the  original  dimension  aimed  at 
improving  the  s/jeed  of  data  mining.  Spam-WEKA  dataset; 
which  entails  twitter  user  information.  The  modified  J48 
classifier  is  applied  to  reduce  the  dimension  of  the  data 
thereby  increasing  the  accuracy  of  data  mining.  The  data 
mining  tool  WEKA  is  used  as  an  API  of  MATLAB  to 
generate  the  J48  classifiers. Experimental  results 
indicated  a  significant  improvement  over  the  existing 
J48algorithm 

Keywords —  Dimension  Reduction;  J48;  WEKA; 
MATLAB. 

I.  INTRODUCTION 

In  the  current  era,  data  usually  has  a  high  volume,  variety, 
velocity,  and  veracity,  these  are  known  as  4  V’s  of  Big 
Data.  Social  media  is  considered  as  one  of  the  main 
causes  of  Big  Data  which  get  the  4  V’s  of  Big  Data  beside 
that  it  has  high  dimensionality.  To  manipulate  Big  Data 
efficiently;  its  dimensionality  should  be  decreased. 
Reducing  dimensionality  converts  the  data  with  high 
dimensionality  into  an  expressive  representation  of  data 
with  lower  dimensions.  Reducing  high  dimensional  text  is 
really  hard,  problem-specific,  and  full  of  tradeoffs. 
Simpler  text  data,  simpler  models,  smaller  vocabularies. 
You  can  always  make  things  more  complex  later  to  see  if 
it  results  in  better  model  skill.  Machine  learning  is 
frequently  characterized  by  a  singular  focus  on  model 
selection.  Be  it  logistic  regression,  random  forests, 
Bayesian  methods,  or  artificial  neural  networks,  machine 
learning  practitioners  are  often  quick  to  express  their 
preference.  The  reason  for  this  is  mostly  historical. 
Though  modem  third-party  machine  learning  libraries 


have  made  the  deployment  of  multiple  models  appear 
nearly  trivial. 

Dimension  reduction  (DR)  is  a  per  processing  step  for 
reducing  the  original  dimensions.  The  aims  of  dimension 
reduction  strategies  are  to  improve  the  speed  and 
precision  of  data  mining.  The  fourma  in  strategies  for  DR 
are:  Supervised-Feature  Selection  (SFS),  unsupervised 
feature  selection  (UFS),  Supervised  Feature 
Transformation  (SFT),and  Unsupervised  Feature 
Transformation(UFT).  Feature  selection  emphases  on 
finding  feature  subsets  that  better  describes  the  data,  as 
good  as  the  original  data  set,  for  supervised  or 
unsupervised  learning  tasks  [Kaur  &  Chhabra, 
(2014)]. Unsupervised  implies  the  reis notrainer,  in  the 
form  of  class  labels.  It  is  important  to  note  that  DR  is  but 
a  preprocessing  stage  of  classification.  In  terms  of 
performance,  having  data  of  high  dimensionality  is 
problematic  because  (a)  it  can  mean  high  computational 
cost  to  perform  learning  and  inference  and  (b)  it  often 
leads  to  over  fitting  when  learning  a  model,  which  means 
that  the  model  will  perform  well  on  the  training  data  but 
poorly  on  test  data.  Dimensionality  reduction  addresses 
both  of  these  problems  while  trying  to  preserving  most  of 
the  relevant  information  in  the  data  needed  to  leam 
accurate,  predictive  models. 

II.  J48  ALGORITHM 

Classification  the  process  of  budding  model  of  classes 
from  asset  of  records  that  contra  in  class  labels.  Decision 
Tree  Algorithm  is  of  in  doubt  the  way  the  attributes- 
vector  be  haves  for  a  number  of  instances  .Also  on  the 
base  soft  the  training  instances,  the  classes  for  the  newly 
generated  instances  are  being  found.  This  algorithm 
generates  the  mles  for  the  prediction  of  the  target 
variable.  With  the  help  of  a  tree  classification  algorithm 
the  critical  distribution  of  the  data  is  easily 
understandable. 

148  is  an  extension  of  ID3.The  additional  features  of 
I48are  accounting  for  missing  values,  decision  trees 
pruning,  continuous  attribute  value  ranges,  derivation  of 
mles,  etc.  In  the  WEKA  data  mining  tool,  148  is  an  open 
source  lava  implementation  of  theC4.5algorithms.The 
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WEKA  tool  provides  a  number  of  options  associated  with 
tree  pruning.  In  case  of  potential  lover  fitting,  pruning 
canbeusedas  a  tool  for  precising.  In  other  algorithms  the 
classification  is  performed  recursively  until  every  single 
leafs  pure,  that  is  the  classification  of  the  data  should  beas 
perfectas  possible.  This  algorithm  generates  the  rules 
from  which  particular  dentity  of  that  data  is  generated. 

The  objective  is  progressively  generalization  of  a  decision 
tree  until  it  gains  equilibrium  of  flexibility  and  accuracy. 

The  following  shows  the  basic  steps  in  the  algorithm 

•  hi  case  the  instances  belong  to  the  same  class  the 
tree  represents  a  leaf  so  the  leaf  is  returned  by  Labeling 
with  the  same  class. 

•  The  potential  information  is  calculated  for  every 
attribute,  given  by  a  test  on  the  attribute.  Then  the  gain  in 
information  is  calculated  that  would  result  from  a  test  on 
the  attribute. 

•  Then  the  best  attribute  is  found  on  the  basis  of 
the  present  selection  criterion  and  that  attribute  selected 
for  branching. 

2.1  Counting  Gain 

This  procedure  uses  the  “ENTROPY”  which  is  a  measure 
of  the  data  disorder.  Entropy  of  jf  is  calculated  as 

Entropy  (y)  =  -  E”=1“jrlog(^)  (1) 

Entropy  (j\y)  =  -  E”=1  y  log  (^)(2) 

Making  Gain 

Garniy] j)  =  Entropy  (; y  —  Entropy  (j\  y*)) 

(3) 

2.2  Pruning 

The  outliers  make  this  a  very  significant  step  to  the  result. 
Some  occurrences  are  present  all  data  sets  which  are  not 
well  defined  and  also  differ  from  the  occurrences  nits 
neighborhood.  The  classification  is  done  on  the  instances 
of  the  training  set  and  tree  is  formed.  The  pruning  is  done 
for  decreasing  errors  in  classification  which  are  produced 
by  specialization  in  the  training  set.  Pruning  is  achieved 
for  the  generality  of  the  tree. 

2.3  Features  of  the  Algorithm 

•  Both  discrete  and  continuous  attributes  are 
handled  by  this  algorithm.  A  threshold  value  is  decided 
by  C4.5  for  managing  continuous  tributes.  This  value 
splits  the  data  list  in  to  the  se  who  have  their  attribute 
value  below  the  threshold  and  the  sheaving  more  the  no  r 
equal  to  it. 

•  This  algorithm  also  takes  care  ofth  missing 
values  in  the  training  data. 

•  After  thetree  isfullybuilt,this  algorithm  does  the 
pruning  of  thetree.C4.5afterits  building  drives  back 
through  the  tree  and  challenges  to  eliminate  branches  that 
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are  not  helping  gin  reaching  the  leaf  nodes. 

III.  RELATED  WORK 

Decision  tree  classifiers  are  widely  used  supervised 
learning  approaches  for  data  exploration,  resembling  or 
approximation  of  a  function  by  piecewise  constant 
regions,  also  does  not  necessitate  preceding  information 
of  the  data  distribution  [Mitra  &  Acharya,  (2003)]. 
Decision  trees  models  are  usually  used  in  data  mining  to 
test  the  data  and  induce  the  tree  and  its  rules  that  will  be 
used  to  make  predictions  [Two  Crows  Corporation, 
(2005)].  The  actual  purpose  of  the  decision  trees  is  to 
categorize  the  data  into  distinctive  groups  that  generate 
the  strongest  of  separations  in  the  values  of  the  reliant 
variables  [Parr  Rud  (2001)],  being  superior  at  identifying 
segments  with  the  desiredcompartment  such  as  activation 
or  response,  hence  providing  an  easily  interpretable 
solution. 

The  concept  of  decision  trees  was  advanced  and  refined 
over  many  years  by  I.  Ross  Quinlan  starting  with  ID3 
[Interactive  Dichotomize  3  (2001)].  A  method  based  on 
this  approach  uses  an  evidence  theoretic  measure,  such  as 
entropy,  for  assessing  the  discriminatory  power  of  every 
attribute  [Mitra  &  Acharya  (2003)].  Major  decision  tree 
algorithms  are  clustered  as  [Mitra  &  Acharya  (2003)]:  (a) 
classifiers  from  the  machine  learning  community:  IDS, 
C4.5,  CART;  and  (b)  classifiers  for  large  databases: 
SUQ,  SPRINT,  SONAR,  Rain  Forest. 

Weka  is  a  very  effective  assemblage  of  machine  learning 
algorithms  to  ease  data  mining  tasks.  It  holds  tools  for 
data  preparation,  regression,  classification,  clustering, 
association  rules  mining,  as  well  as  visualization.  Weka  is 
used  in  this  research  to  implements  the  most  common 
decision  tree  construction  algorithm:  C4.5  known  as  J48 
in  weka.  it  is  one  of  the  more  famous  Logic  Programming 
methods,  developed  by  Quinlan  [Quinlan  JR  (1986)],  an 
attribute-based  machine  learning  algorithm  for  creating  a 
decision  tree  on  a  training  set  of  data  and  an  entropy 
measure  to  build  the  leaves  of  the  tree.  C4.5  algorithms 
are  based  on  the  ID3,  with  supplementary  programming 
to  address  ID3  problems. 

IV.  PROPOSED  TECHNIQUE  AND 

FRAMEWORK 

The  WEKA  tool  has  emerged  with  innovatory  and 
effective  as  well  as  relatively  easiest  data  mining  and 
machine  learning  solutions.  Since  1994,  this  tool  was 
developed  by  the  WEKA  team.  WEKA  contains  many 
inbuilt  algorithms  for  data  mining  and  machine  learning. 
It  is  an  open  source  and  freely  available  platform.  People 
with  little  knowledge  of  data  mining  can  also  use  this 
software  very  easily  since  it  provides  flexible  abilities  for 
scripting  experiments.  As  new  algorithms  appear  in  the 
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research  literature,  these  are  updated  in  the  software. 
WEKA  has  also  gained  some  reputation  which  makes  it 
one  of  the  favorite  tool  for  data  mining  research  and 
assisted  to  progress  it  by  making  numerous  powerful 
features  available  to  all. 

4.1  The  following  are  steps  performed  for  data  mining 
in  WEKA: 

•  Data  pre-processing  and  visualization 

•  Attribute  selection 

•  Classification  (Decision  trees) 

•  Prediction  (Nearest  Neighbor) 

•  Model  evaluation 

•  Clustering  (Cobweb,  K-means) 

•  Associationmles 

4.2  J48  Improvement 


MALAB 

l 

Load  ARFF  Dataset 

l 

Refine  Loaded  Data 

set  in  MATLAB 

l 

Apply  J48  Algorithm 
in  WEKA 

l 

Results  evaluated 

L.  U 

[ 

Fig.l :  Flow  Chart  and  Set-Up 


V.  EXPERIMENTATION 


This  section  shows  results  and  how  performance  was 
evaluated;  the  J48  algorithm  is  also  compared  to  other 


algorithms . 

The  formula  employed  for  calculating  the  accuracy  is 
(rp+rw) 


TA  = 


TP+TN+FP+FN 

„ .  (tp+fp)*(tn+fn)*(fn+tp)*(fp+tp) 

RA  =  - - - - 

( T  otal*T  otal ) 


(4) 


(5) 


In  the  equation  (4)TA  =  Total  Accuracy  ,TP=True 
Positive,  TN=True  Negative  ,FP  =  False  Positive  and 
FN=  False  Negative.  Inequation(5)  RA  represents 
Random  Accuracy. 

Fig  2,  shows  the  tested  negative  and  positive  values  of 
spammers  with  respect  to  the  various  attributes.  It  shows 
the  total  number  of  classified  spammers  and  non¬ 
spammers  per  the  dataset  in  WEKA  environment. 
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Fig.2:  Data  representation  by  class  in  Weka  environment 


Table  1,  indicates  the  output  of  classification  represented 
in  the  following  confusion  matrix  for  spammers  and  non  ¬ 
spammers. 


Table. 1:  Confusion  matrix 


a 

b 

classified  as 

2316 

2684 

a=spammer 

720 

94280 

b=non-spammer 

Table  2  shows  the  results  of  various  algorithms  against 
the  performance  of  the  proposed  improved  technique. 


Table. 2:  Peiformance  comparison  of  other  algorithms 


Algorithm 

Accuracy  % 

Error 

% 

Naive  Bayes 

54.46 

45.54 

Multi  Class  Classifier 

94.999 

5.001 

Random  Tree 

94.98 

5.02 

REP  Tree 

96.347 

3.653 

Random  Forest 

96.962 

3.038 

148 

96.596 

3.404 

Improved  J48 

98.607 

1.393 

Improved  J48 
J48 

Random  Forest 
REPTree 
Random  Tree 
M  u  It  i  Cl  a  s sC  I  as  s  if  i e  r 
NaiveBayes 


0  20  40  60  80  100  120 


■  Error  1  Accuracy 


Fig. 3:  Results  of  algorithms  in  percentage 
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Fig  3  shows  the  comparison  graph  of  the  various 
algorithms  on  accuracy  and  error  rate.  It  clearly  shows 
how  the  improved  technique  performs  better  than  the 
others  with  its  accuracy  rate  of  98.607  %. 

VL  CONCLUSIONS  AND  FUTURE  TREND 

This  research  proposes  an  approach  for  efficient 
prediction  of  spammers  from  records  of  Twitter  users.  It 
is  able  to  correctly  predict  spammers  and  no-spammers 
with  u  to  98.607%  accuracy  rate.  The  improved  technique 
makes  use  of  the  data  mining  tool  WEKA,  which  is  used 
together  with  MATLAB  for  generating  an  improved  J48 
classifier.  The  experiment  results  speak  for  itself. 

In  the  near  future,  some  more  datasets  will  be  used  to 
validate  the  proposed  algorithm.  Only  100000  instances 
were  used  for  this  research  work,  a  larger  and  more 
dynamic  dataset  should  be  considered  in  other  to  test  the 
effectiveness  of  this  algorithm 
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